Max Pellert (https://mpellert.at)
Deep Learning for the Social Sciences
Introduction of the concept of attention as an additional mechanism, but still using an RNN Encoder–Decoder
First, let’s introduce some terminology (taken from the field of information retrieval)
Consider the problem of choosing which movie to watch in an online movie streaming service:
Associate each movie with a list of attributes describing things such as the genre (comedy, action, etc.), the names of the leading actors, the length of the movie and so on
You as a user could then search through the catalogue to find a movie that matches your preferences
We could encode the attributes of each movie in a vector called the key.
The corresponding movie file itself is called a value
Similarly, you could then provide your own personal vector of values for the desired attributes that you care about in a movie, which we call the query
The movie service could then compare the query vector with all the key vectors to find the best match and send the corresponding movie to you in the form of the value file
You were “attending” to the particular movie whose key most closely matches your query
This would be considered a form of hard attention in which a single value vector (= movie in this case) is returned
For what follows about the transformer, we generalize this to soft attention
We use continuous variables to measure the degree of match between queries and keys and we then use these variables to weight the influence of the value vectors
For the attention mechanism in transformers, values, keys and queries refer not to movies, their traits and user preferences but to the tokens in the input sequence
This process is called self-attention because we are using the same sequence to determine the queries, keys, and values
We will briefly talk variants of this attention mechanism later, for example cross-attention
Also, because the measure of similarity between query and key vectors is given by a dot product, this is known as dot-product self-attention
\[\mathbf{x}^T\mathbf{y} = ||\mathbf{x}|| \cdot ||\mathbf{y}|| \cdot \cos{\theta}\]
Scaling is one more refinement that we have to do
The gradients of the softmax function become exponentially small for inputs of high magnitude (resulting in something very close to hard attention)
To prevent this from happening, we can re-scale the product of the query and key vectors before applying the softmax function
There are many possible ways to do that, let’s consider one
If the elements of the query and key vectors were all independent random numbers with zero mean and unit variance, then the variance of the dot product would be \(D_k\)
We can therefore normalize the argument to the softmax using the standard deviation given by the square root of \(D_k\), therefore the output of the attention layer takes the final form:
The attention layer described so far allows the output vectors to attend to data-dependent patterns of input vectors and is called an attention head
However, there might be multiple patterns of attention that are relevant at the same time
For example, in natural language, some patterns might be relevant to tense whereas others might be associated with vocabulary
A single attention head can lead to averaging over these effects
Instead, let’s use multiple attention heads in parallel
These are identically structured copies of the single head, with independent learnable parameters that govern the calculation of the query, key, and value matrices
Suppose we have \(H\) heads indexed by \(h = 1, \ldots , H\) of the form
\(H_h = \mathrm{Attention}(Q_h, K_h, V_h)\)
The heads are first concatenated into a single matrix, and the result is then linearly transformed using a matrix W(o) to give a combined output in the form:
\(Y(X) = \mathrm{Concat} [H_1 , \ldots , H_H] W^{(o)}\)
To improve training, we can introduce residual connections that bypass the multi-head structure (this is visualized with links in the overview on the transformer architecture)
This is then followed by layer normalization, which further improves training efficiency
To enhance flexibility by introducing more non-linearity, we also add a multilayer perceptron, for example a two-layer fully connected network with ReLU hidden units.
The resulting transformation can be written as:
\(Z = \mathrm{LayerNorm}[\mathrm{MLP}(Y(X)) + X]\)
Let’s compute layer normalization statistics over all the hidden units in the same layer:
\(H\) denotes the number of hidden units in a layer; under layer normalization, all the hidden units in a layer share the same normalization terms \(\mu\) and \(\sigma\):
\(\mu^{l} = \frac{1}{H} \sum_{i=1}^{H} a_i^{l}\)
\(\sigma^{l} = \sqrt{\frac{1}{H} \sum_{i=1}^{H} ( a_i^{l} - \mu_i^{l} )}\)
We just substract \(\mu\) and divide by \(\sigma\) in order to have a standard normal distribution for the layer (mean of 0 and variance of 1)
In a typical transformer there are multiple such multihead attention layers together with the other parts as described stacked on top of each other
The layers generally have identical structures, although there is no sharing of weights and biases between different layers.
The \(n^{th}\) output \(\mathbf{sa}_n[\mathbf{x}_1,\ldots,\mathbf{x}_N]\) is a weighted sum of all the values \(\mathbf{v}_1,\ldots, \mathbf{v}_N\):
\(\mathbf{sa}_n[\mathbf{x}_1,\ldots,\mathbf{x}_N] = \sum_{m=1}^{N} a[\mathbf{x}_m, \mathbf{x}_n]\mathbf{v}_m\)
The woman ate the fish
vs.
The fish ate the woman
The man ate the racoon
vs.
The racoon ate the man
The last thing that is missing
This part is actually conceptually tough and not easy to understand, best is just to follow along and to accept that this is working
In transformers, the matrices \(W_h^{(q)}\), \(W_h^{(k)}\) and \(W_h^{(v)}\) are shared across the input tokens, as is the subsequent neural network
From this follows that the transformer has the following property:
Permuting the order of the input tokens, i.e. the rows of \(X\), results in the same permutation of the rows of the output matrix \(\tilde{X}\)
This equivariance is actually helpful, it facilitates the massively parallel processing of the transformer and also allows the network to learn long-range dependencies just as effectively as short-range dependencies
But that becomes a major limitation when we consider sequential data, such as the words in a natural language, because the representation learned by a transformer will be independent of the input token ordering
We have to encode information on word positions
We aim to encode the token order in the data itself in order not to complicate the architecture further
We will therefore construct a position encoding vector \(r_n\) associated with each input position \(n\) and then combine this with the associated input token embedding \(x_n\)
One obvious way to combine these vectors would be to concatenate them, but this would increase the dimensionality of the input space and hence of all subsequent attention spaces, creating a significant increase in computational cost
Instead, we can simply add the position vectors onto the token vectors to give
\(\tilde{x}_n = x_n + r_n\)
An ideal positional encoding should:
There are many approaches to positional encoding
The original technique by Vaswani et al. (2017) is based on sinusoidal functions
For a given position \(n\) the associated position-encoding vector has components \(r_ni\) given by: